feat(csv): split GeoJSON and PMTiles into low-priority RQ jobs by bolinocroustibat · Pull Request #408 · datagouv/hydra

bolinocroustibat · 2026-04-03T14:37:43Z

Closes #412 (and duplicate datagouv/data.gouv.fr#1991).

This is built on #404 where GeoJSON is generated from the PostgreSQL parsing table (streaming from the DB) instead of re-reading the CSV file.

Main Changes:

GeoJSON and PMTiles now run as separate RQ jobs on the low queue so scheduling can treat them differently from CSV-to-DB ingest. When CSV_TO_DB is enabled, geo work does not depend on the download temp file still being on disk.
The previous single geojson.py module is split into three smaller modules so CSV export, PMTiles conversion, and the remaining orchestration are easier to follow and test.
Removed CSV_TO_GEOJSON config: since GeoJSON for CSVs now comes from the Postgres parsing table, not by reading the file again, this old flag was redundant. One switch (DB_TO_GEOJSON) is enough to turn that on or off.
Refactored utils/files.py to use streaming HTTP for download_url_to_tempfile and relocate and unify it with download_resource : 357dfb0

Suggested in later, separate PRs:

Harmonize task_* RQ entrypoint names across the codebase
Pick a clearer module name than geojson.py for what that file still owns now that CSV GeoJSON and PMTiles live in their own modules (what about geo_analysis.py?)

ThibaudDauce · 2026-04-06T13:37:16Z

+async def download_url_to_tempfile(url: str, suffix: str = "") -> Path:
+    """Download a URL to a named temporary file and return its path."""
+    async with aiohttp.ClientSession() as session:
+        async with session.get(url) as resp:
+            resp.raise_for_status()
+            body = await resp.read()
+    fd, raw_path = tempfile.mkstemp(suffix=suffix)
+    path = Path(raw_path)
+    try:
+        os.write(fd, body)
+    finally:
+        os.close(fd)
+    log.debug(f"Downloaded {url} to {path} ({len(body)} bytes)")
+    return path


Is it possible to stream to file instead of putting the all file is RAM before writing to disk?

It's possible, we are already doing it in utils/files.py -> download_resource. We could use a shared function between the two.

Refactored utils/files.py to use streaming HTTP for download_url_to_tempfile, also relocated and unify it with download_resource: 357dfb0

ThibaudDauce · 2026-04-06T13:39:06Z

+    record = await Check.get_by_id(check_id, with_deleted=True)
+    if not record:
+        log.error(f"task_csv_to_geojson: check {check_id} not found")
+        return


Check.get_by_id return the check only if it is the last one for this resource. If there is a new check before this job start it will not run. Not sure if it's a problem in real life but maybe worth noting that if the low queue have trouble depiling we may never run this task (if new checks for this ressource arrive before the queue unpile)

Geo export runs as chained tasks on the low queue; GeoJSON is generated from the PostgreSQL parsing table when CSV_TO_DB is enabled. RQ entrypoint uses context for exception-queue routing. Made-with: Cursor

# Conflicts: # udata_hydra/analysis/geojson.py

Co-authored-by: Thibaud Ollagnier <ThibaudDauce@users.noreply.github.com>

…url_to_tempfile - Add _http_get_to_temp_path (Path, total_bytes cap, shared session options) - Factor download_resource and download_url_to_tempfile through it - GeoJSON→PMTiles: headers, max_size_allowed, IOException handling - download_file: BinaryIO + HTTP_DOWNLOAD_CHUNK_SIZE; remove_remainders uses Path.unlink Made-with: Cursor

bolinocroustibat self-assigned this Apr 3, 2026

bolinocroustibat marked this pull request as draft April 3, 2026 14:40

bolinocroustibat mentioned this pull request Apr 3, 2026

feat: generate GeoJSON from PostgreSQL instead of re-reading CSV #404

Open

bolinocroustibat requested review from Pierlou, ThibaudDauce and maudetes April 3, 2026 14:45

bolinocroustibat added the enhancement New feature or request label Apr 3, 2026

bolinocroustibat force-pushed the feat/separate-geo-tasks branch 7 times, most recently from a76bc0e to af597d2 Compare April 3, 2026 15:47

bolinocroustibat marked this pull request as ready for review April 3, 2026 15:48

bolinocroustibat changed the base branch from main to geojson-from-db April 3, 2026 15:54

bolinocroustibat force-pushed the feat/separate-geo-tasks branch 4 times, most recently from 98ff286 to 90df48e Compare April 3, 2026 16:23

bolinocroustibat mentioned this pull request Apr 3, 2026

[hydra] Découper les tâches d'analyse csv en sous-tâches #412

Open

ThibaudDauce reviewed Apr 6, 2026

View reviewed changes

bolinocroustibat force-pushed the feat/separate-geo-tasks branch 2 times, most recently from 0ef471b to 5086a36 Compare April 7, 2026 11:51

bolinocroustibat requested a review from ThibaudDauce April 7, 2026 12:46

maudetes mentioned this pull request Apr 7, 2026

[hydra] ajouter une config d'exclusion de conversion géographique sur certains patterns de ressources datagouv/data.gouv.fr#1992

Open

bolinocroustibat added 4 commits April 8, 2026 09:11

feat(csv): split GeoJSON and PMTiles into low-priority RQ jobs

9141163

Geo export runs as chained tasks on the low queue; GeoJSON is generated from the PostgreSQL parsing table when CSV_TO_DB is enabled. RQ entrypoint uses context for exception-queue routing. Made-with: Cursor

clean: remove CSV_TO_GEOJSON setting

01d2ffb

clean: better string interpolation

f5b527a

clean: use Path objects

0bce233

bolinocroustibat and others added 3 commits April 8, 2026 09:11

refactor: refactor geojson.py into several files

23fd847

# Conflicts: # udata_hydra/analysis/geojson.py

clean: update udata_hydra/analysis/geojson.py

e2b379c

Co-authored-by: Thibaud Ollagnier <ThibaudDauce@users.noreply.github.com>

bolinocroustibat force-pushed the feat/separate-geo-tasks branch from 357dfb0 to c309aed Compare April 8, 2026 07:11

maudetes mentioned this pull request Apr 16, 2026

fix: use basename as param in jobs #416

Merged

bolinocroustibat force-pushed the feat/separate-geo-tasks branch from aafcc62 to c309aed Compare April 22, 2026 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(csv): split GeoJSON and PMTiles into low-priority RQ jobs#408

feat(csv): split GeoJSON and PMTiles into low-priority RQ jobs#408
bolinocroustibat wants to merge 7 commits intogeojson-from-dbfrom
feat/separate-geo-tasks

bolinocroustibat commented Apr 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

ThibaudDauce Apr 6, 2026

Uh oh!

bolinocroustibat Apr 7, 2026 •

edited

Loading

Uh oh!

bolinocroustibat Apr 7, 2026 •

edited

Loading

Uh oh!

ThibaudDauce Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bolinocroustibat commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main Changes:

Suggested in later, separate PRs:

Uh oh!

Uh oh!

ThibaudDauce Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

bolinocroustibat Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bolinocroustibat Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThibaudDauce Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bolinocroustibat commented Apr 3, 2026 •

edited

Loading

bolinocroustibat Apr 7, 2026 •

edited

Loading

bolinocroustibat Apr 7, 2026 •

edited

Loading